The dataset, HappyDB is a corpus of 100,000 crowd-sourced happy moments. The goal of the corpus is to advance the state of the art of understanding the causes of happiness that can be gleaned from text.

Environmental settings of this project.

R.version
               _                           
platform       aarch64-apple-darwin20      
arch           aarch64                     
os             darwin20                    
system         aarch64, darwin20           
status                                     
major          4                           
minor          2.2                         
year           2022                        
month          10                          
day            31                          
svn rev        83211                       
language       R                           
version.string R version 4.2.2 (2022-10-31)
nickname       Innocent and Trusting       

Step 1: data processing and mining (From Text_processing.rmd)

Step 0 - Load all the required libraries

From the packages’ descriptions:

  • tm is a framework for text mining applications within R;
  • tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;
  • tidytext allows text mining using ‘dplyr’, ‘ggplot2’, and other tidy tools;
  • DT provides an R interface to the JavaScript library DataTables.

Step 1 - Load the data to be cleaned and processed

Step 2 - Preliminary cleaning of text

Clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.

corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
  tm_map(content_transformer(tolower))%>%
  tm_map(removePunctuation)%>%
  tm_map(removeNumbers)%>%
  tm_map(removeWords, character(0))%>%
  tm_map(stripWhitespace)

Step 3 - Stemming words and converting tm object to tidy object

Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.

stemmed <- tm_map(corpus, stemDocument) %>%
  tidy() %>%
  select(text)

Step 4 - Creating tidy format of the dictionary to be used for completing stems

We also need a dictionary to look up the words corresponding to the stems.

dict <- tidy(corpus) %>%
  select(text) %>%
  unnest_tokens(dictionary, text)
Warning: Outer names are only allowed for unnamed scalar atomic inputs

Step 5 - Removing stopwords that don’t hold any significant information for our data set

We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.

data("stop_words")

word <- c("happy","ago","yesterday","lot","today","months","month",
                 "happier","happiest","last","week","past")

stop_words <- stop_words %>%
  bind_rows(mutate(tibble(word), lexicon = "updated"))

Step 6 - Combining stems and dictionary into the same tibble

Here we combine the stems and the dictionary into the same “tidy” object.

completed <- stemmed %>%
  mutate(id = row_number()) %>%
  unnest_tokens(stems, text) %>%
  bind_cols(dict) %>%
  anti_join(stop_words, by = c("dictionary" = "word"))
Warning: Outer names are only allowed for unnamed scalar atomic inputs

Step 7 - Stem completion

Lastly, we complete the stems by picking the corresponding word with the highest frequency.

completed <- completed %>%
  group_by(stems) %>%
  count(dictionary) %>%
  mutate(word = dictionary[which.max(n)]) %>%
  ungroup() %>%
  select(stems, word) %>%
  distinct() %>%
  right_join(completed) %>%
  select(-stems)

Step 8 - Pasting stem completed individual words into their respective happy moments

We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.

completed <- completed %>%
  group_by(id) %>%
  summarise(text = str_c(word, collapse = " ")) %>%
  ungroup()

Step 9 - Keeping a track of the happy moments with their own ID

hm_data <- hm_data %>%
  mutate(id = row_number()) %>%
  inner_join(completed)

datatable(hm_data)

Exporting the processed text data into a CSV file

write_csv(hm_data, "../output/processed_moments.csv")

The final processed data is ready to be used for any kind of analysis.

LS0tCnRpdGxlOiAiVGhlIERpZmZlcmVuY2VzIG9mIHRoZSBDYXVzZXMgb2YgSGFwcGluZXNzIEJldHdlZW4gRmVtYWxlIGFuZCBNYWxlIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBUaGUgZGF0YXNldCwgSGFwcHlEQiBpcyBhIGNvcnB1cyBvZiAxMDAsMDAwIGNyb3dkLXNvdXJjZWQgaGFwcHkgbW9tZW50cy4gVGhlIGdvYWwgb2YgdGhlIGNvcnB1cyBpcyB0byBhZHZhbmNlIHRoZSBzdGF0ZSBvZiB0aGUgYXJ0IG9mIHVuZGVyc3RhbmRpbmcgdGhlIGNhdXNlcyBvZiBoYXBwaW5lc3MgdGhhdCBjYW4gYmUgZ2xlYW5lZCBmcm9tIHRleHQuCgoKIyMgRW52aXJvbm1lbnRhbCBzZXR0aW5ncyBvZiB0aGlzIHByb2plY3QuCgpgYGB7cn0KUi52ZXJzaW9uCmBgYAoKCiMgU3RlcCAxOiBkYXRhIHByb2Nlc3NpbmcgYW5kIG1pbmluZyAoRnJvbSBUZXh0X3Byb2Nlc3Npbmcucm1kKQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKYGBgCgoKIyMjIFN0ZXAgMCAtIExvYWQgYWxsIHRoZSByZXF1aXJlZCBsaWJyYXJpZXMKCkZyb20gdGhlIHBhY2thZ2VzJyBkZXNjcmlwdGlvbnM6CgorIGB0bWAgaXMgYSBmcmFtZXdvcmsgZm9yIHRleHQgbWluaW5nIGFwcGxpY2F0aW9ucyB3aXRoaW4gUjsKKyBgdGlkeXZlcnNlYCBpcyBhbiBvcGluaW9uYXRlZCBjb2xsZWN0aW9uIG9mIFIgcGFja2FnZXMgZGVzaWduZWQgZm9yIGRhdGEgc2NpZW5jZS4gQWxsIHBhY2thZ2VzIHNoYXJlIGFuIHVuZGVybHlpbmcgZGVzaWduIHBoaWxvc29waHksIGdyYW1tYXIsIGFuZCBkYXRhIHN0cnVjdHVyZXM7CisgYHRpZHl0ZXh0YCBhbGxvd3MgdGV4dCBtaW5pbmcgdXNpbmcgJ2RwbHlyJywgJ2dncGxvdDInLCBhbmQgb3RoZXIgdGlkeSB0b29sczsKKyBgRFRgIHByb3ZpZGVzIGFuIFIgaW50ZXJmYWNlIHRvIHRoZSBKYXZhU2NyaXB0IGxpYnJhcnkgRGF0YVRhYmxlcy4KCmBgYHtyIGxvYWQgbGlicmFyaWVzLCB3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFLCBlY2hvPUZBTFNFfQpsaWJyYXJ5KHRtKQpsaWJyYXJ5KHRpZHl0ZXh0KQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShEVCkKYGBgCgojIyMgU3RlcCAxIC0gTG9hZCB0aGUgZGF0YSB0byBiZSBjbGVhbmVkIGFuZCBwcm9jZXNzZWQKCmBgYHtyIHJlYWQgZGF0YSwgd2FybmluZz1GQUxTRSwgbWVzc2FnZT1GQUxTRSwgZWNobz1GQUxTRX0KdXJsZmlsZTwtJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9yaXQtcHVibGljL0hhcHB5REIvbWFzdGVyL2hhcHB5ZGIvZGF0YS9jbGVhbmVkX2htLmNzdicKaG1fZGF0YSA8LSByZWFkX2Nzdih1cmxmaWxlKQpgYGAKCiMjIyBTdGVwIDIgLSBQcmVsaW1pbmFyeSBjbGVhbmluZyBvZiB0ZXh0CgpDbGVhbiB0aGUgdGV4dCBieSBjb252ZXJ0aW5nIGFsbCB0aGUgbGV0dGVycyB0byB0aGUgbG93ZXIgY2FzZSwgYW5kIHJlbW92aW5nIHB1bmN0dWF0aW9uLCBudW1iZXJzLCBlbXB0eSB3b3JkcyBhbmQgZXh0cmEgd2hpdGUgc3BhY2UuCgpgYGB7ciB0ZXh0IHByb2Nlc3NpbmcgaW4gdG19CmNvcnB1cyA8LSBWQ29ycHVzKFZlY3RvclNvdXJjZShobV9kYXRhJGNsZWFuZWRfaG0pKSU+JQogIHRtX21hcChjb250ZW50X3RyYW5zZm9ybWVyKHRvbG93ZXIpKSU+JQogIHRtX21hcChyZW1vdmVQdW5jdHVhdGlvbiklPiUKICB0bV9tYXAocmVtb3ZlTnVtYmVycyklPiUKICB0bV9tYXAocmVtb3ZlV29yZHMsIGNoYXJhY3RlcigwKSklPiUKICB0bV9tYXAoc3RyaXBXaGl0ZXNwYWNlKQpgYGAKCiMjIyBTdGVwIDMgLSBTdGVtbWluZyB3b3JkcyBhbmQgY29udmVydGluZyB0bSBvYmplY3QgdG8gdGlkeSBvYmplY3QKClN0ZW1taW5nIHJlZHVjZXMgYSB3b3JkIHRvIGl0cyB3b3JkICpzdGVtKi4gV2Ugc3RlbSB0aGUgd29yZHMgaGVyZSBhbmQgdGhlbiBjb252ZXJ0IHRoZSAidG0iIG9iamVjdCB0byBhICJ0aWR5IiBvYmplY3QgZm9yIG11Y2ggZmFzdGVyIHByb2Nlc3NpbmcuCgpgYGB7ciBzdGVtbWluZ30Kc3RlbW1lZCA8LSB0bV9tYXAoY29ycHVzLCBzdGVtRG9jdW1lbnQpICU+JQogIHRpZHkoKSAlPiUKICBzZWxlY3QodGV4dCkKYGBgCgojIyMgU3RlcCA0IC0gQ3JlYXRpbmcgdGlkeSBmb3JtYXQgb2YgdGhlIGRpY3Rpb25hcnkgdG8gYmUgdXNlZCBmb3IgY29tcGxldGluZyBzdGVtcwoKV2UgYWxzbyBuZWVkIGEgZGljdGlvbmFyeSB0byBsb29rIHVwIHRoZSB3b3JkcyBjb3JyZXNwb25kaW5nIHRvIHRoZSBzdGVtcy4KCmBgYHtyIHRpZHkgZGljdGlvbmFyeX0KZGljdCA8LSB0aWR5KGNvcnB1cykgJT4lCiAgc2VsZWN0KHRleHQpICU+JQogIHVubmVzdF90b2tlbnMoZGljdGlvbmFyeSwgdGV4dCkKYGBgCgojIyMgU3RlcCA1IC0gUmVtb3Zpbmcgc3RvcHdvcmRzIHRoYXQgZG9uJ3QgaG9sZCBhbnkgc2lnbmlmaWNhbnQgaW5mb3JtYXRpb24gZm9yIG91ciBkYXRhIHNldAoKV2UgcmVtb3ZlIHN0b3B3b3JkcyBwcm92aWRlZCBieSB0aGUgInRpZHl0ZXh0IiBwYWNrYWdlIGFuZCBhbHNvIGFkZCBjdXN0b20gc3RvcHdvcmRzIGluIGNvbnRleHQgb2Ygb3VyIGRhdGEuCgpgYGB7ciBzdG9wd29yZHN9CmRhdGEoInN0b3Bfd29yZHMiKQoKd29yZCA8LSBjKCJoYXBweSIsImFnbyIsInllc3RlcmRheSIsImxvdCIsInRvZGF5IiwibW9udGhzIiwibW9udGgiLAogICAgICAgICAgICAgICAgICJoYXBwaWVyIiwiaGFwcGllc3QiLCJsYXN0Iiwid2VlayIsInBhc3QiKQoKc3RvcF93b3JkcyA8LSBzdG9wX3dvcmRzICU+JQogIGJpbmRfcm93cyhtdXRhdGUodGliYmxlKHdvcmQpLCBsZXhpY29uID0gInVwZGF0ZWQiKSkKYGBgCgojIyMgU3RlcCA2IC0gQ29tYmluaW5nIHN0ZW1zIGFuZCBkaWN0aW9uYXJ5IGludG8gdGhlIHNhbWUgdGliYmxlCgpIZXJlIHdlIGNvbWJpbmUgdGhlIHN0ZW1zIGFuZCB0aGUgZGljdGlvbmFyeSBpbnRvIHRoZSBzYW1lICJ0aWR5IiBvYmplY3QuCgpgYGB7ciB0aWR5IHN0ZW1zIHdpdGggZGljdGlvbmFyeX0KY29tcGxldGVkIDwtIHN0ZW1tZWQgJT4lCiAgbXV0YXRlKGlkID0gcm93X251bWJlcigpKSAlPiUKICB1bm5lc3RfdG9rZW5zKHN0ZW1zLCB0ZXh0KSAlPiUKICBiaW5kX2NvbHMoZGljdCkgJT4lCiAgYW50aV9qb2luKHN0b3Bfd29yZHMsIGJ5ID0gYygiZGljdGlvbmFyeSIgPSAid29yZCIpKQpgYGAKCiMjIyBTdGVwIDcgLSBTdGVtIGNvbXBsZXRpb24KCkxhc3RseSwgd2UgY29tcGxldGUgdGhlIHN0ZW1zIGJ5IHBpY2tpbmcgdGhlIGNvcnJlc3BvbmRpbmcgd29yZCB3aXRoIHRoZSBoaWdoZXN0IGZyZXF1ZW5jeS4KCmBgYHtyIHN0ZW0gY29tcGxldGlvbiwgd2FybmluZz1GQUxTRSwgbWVzc2FnZT1GQUxTRX0KY29tcGxldGVkIDwtIGNvbXBsZXRlZCAlPiUKICBncm91cF9ieShzdGVtcykgJT4lCiAgY291bnQoZGljdGlvbmFyeSkgJT4lCiAgbXV0YXRlKHdvcmQgPSBkaWN0aW9uYXJ5W3doaWNoLm1heChuKV0pICU+JQogIHVuZ3JvdXAoKSAlPiUKICBzZWxlY3Qoc3RlbXMsIHdvcmQpICU+JQogIGRpc3RpbmN0KCkgJT4lCiAgcmlnaHRfam9pbihjb21wbGV0ZWQpICU+JQogIHNlbGVjdCgtc3RlbXMpCmBgYAoKIyMjIFN0ZXAgOCAtIFBhc3Rpbmcgc3RlbSBjb21wbGV0ZWQgaW5kaXZpZHVhbCB3b3JkcyBpbnRvIHRoZWlyIHJlc3BlY3RpdmUgaGFwcHkgbW9tZW50cwoKV2Ugd2FudCBvdXIgcHJvY2Vzc2VkIHdvcmRzIHRvIHJlc2VtYmxlIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIG9yaWdpbmFsIGhhcHB5IG1vbWVudHMuIFNvIHdlIHBhc3RlIHRoZSB3b3JkcyB0b2dldGhlciB0byBmb3JtIGhhcHB5IG1vbWVudHMuCgpgYGB7ciByZXZlcnNlIHVubmVzdH0KY29tcGxldGVkIDwtIGNvbXBsZXRlZCAlPiUKICBncm91cF9ieShpZCkgJT4lCiAgc3VtbWFyaXNlKHRleHQgPSBzdHJfYyh3b3JkLCBjb2xsYXBzZSA9ICIgIikpICU+JQogIHVuZ3JvdXAoKQpgYGAKCiMjIyBTdGVwIDkgLSBLZWVwaW5nIGEgdHJhY2sgb2YgdGhlIGhhcHB5IG1vbWVudHMgd2l0aCB0aGVpciBvd24gSUQKCmBgYHtyIGNsZWFuZWQgaG1fZGF0YSwgd2FybmluZz1GQUxTRSwgbWVzc2FnZT1GQUxTRX0KaG1fZGF0YSA8LSBobV9kYXRhICU+JQogIG11dGF0ZShpZCA9IHJvd19udW1iZXIoKSkgJT4lCiAgaW5uZXJfam9pbihjb21wbGV0ZWQpCgpkYXRhdGFibGUoaG1fZGF0YSkKYGBgCgojIyMgRXhwb3J0aW5nIHRoZSBwcm9jZXNzZWQgdGV4dCBkYXRhIGludG8gYSBDU1YgZmlsZQoKYGBge3IgZXhwb3J0IGRhdGF9CndyaXRlX2NzdihobV9kYXRhLCAiLi4vb3V0cHV0L3Byb2Nlc3NlZF9tb21lbnRzLmNzdiIpCmBgYAoKVGhlIGZpbmFsIHByb2Nlc3NlZCBkYXRhIGlzIHJlYWR5IHRvIGJlIHVzZWQgZm9yIGFueSBraW5kIG9mIGFuYWx5c2lzLgoKCgpgYGB7cn0KCmBgYAoKCgo=